------------------------------------------------------------------------------------
Where are the processed forms?
------------------------------------------------------------------------------------

All the processed forms are in the FormsFullSample folder, and are
described in the next section.

The folders FormsSubsample1-4 contain exactly the same files as
FormsFullSample, but restricted to participants assigned to subsamples
1-4, balanced by the date when they entered the study, and then by
SEX,AGE,EDUCATION, and INCOME. The NIH HV and Patients were added
after this process to subsample 1, which is slightly larger than the others.
If doing predictive modelling, I'd suggest using 1+2 for model
development, 3 for validation or model development, and 4 for held-out testing.

------------------------------------------------------------------------------------
Which forms are there?
------------------------------------------------------------------------------------

The names used for the forms in CTDB are the following:

- baseline
	'AUDIT',
	'DSMXC'
	'FIGS'
	'KESSLER5'
	'WHODAS'
	'drug_use'
	'clinical_history'
	'demographics'
- end of study
      	'chronic_pain'
	'emotional_support'
	'everyday_discrimination'
	'motivation'
	'social_support'
	'personality'
	'clinical_form_end_of_study'
	'end_of_study'
	'BRS'
	'BTQ',
- recurring
	'COVID-19-R'
	'KESSLER5'
	'DSMXC'

Note that recurring forms are available at both baseline *and* end of study periods.

The processed files have the following naming convention:

    final-<FORM NAME>-baseline.csv - baseline data, all participants in this sample
    final-<FORM NAME>-baseline-NIH-participants.csv - baseline, just the NIH participants

    final-<FORM NAME>-end_of_study.csv - end of study, all participants in this sample
    final-<FORM NAME>-end_of_study-NIH-participants.csv - end of study, just the NIH participants

    final-<FORM NAME>-all_time_points.csv - recurring forms, all participants, all time points (includes baseline and end of study)

------------------------------------------------------------------------------------
New forms were derived from the recurring ones
------------------------------------------------------------------------------------

In addition to the recurring forms, we derived others from them:

1) final-derived_outcomes-all_time_points.csv

This contains new variables derived from recurring ones:
- KESSLER5_TOTAL - sum of all KESSLER5 items
- UCLA_LONELINESS - sum of the three items in the COVID-19-R form corresponding to the UCLA loneliness scale
- DSMXC_FACTOR1-6 - the six factors from the DSMXC model (Cristan provided this)
- DSMXC_BIFAC_G/1/2 - the three factors from the DSMXC bifactor model (Cristan provided this)

2) separate parts of the COVID-19-R form for all time points, containing subsets of the questions:

- circumstances - final-COVID-19-R-circumstances-all_time_points.csv
- behavioral outcomes - final-COVID-19-R-outcomes_behavioral-all_time_points.csv
- psychological outcomes - final-COVID-19-R-outcomes_psychological-all_time_points.csv
- loneliness items - final-COVID-19-R-loneliness-all_time_points.csv
- free text - final-COVID-19-R-text-all_time_points.csv

3) the variables making up the NIEHS Pandemic Vulnerability Index for the participant zip code

https://www.niehs.nih.gov/research/programs/coronavirus/covid19pvi/details/
https://github.com/COVID19PVI/data/blob/master/PVI_codebook.pdf

- ZIP code (first three digits)

There are two current models (11.2 and 12.4) used to compute PVI scores. The main difference between 11.2 and 12.4 is the inclusion of vaccine data and ventilators (and different re-weighting) in model 12.4. In our case, we extracted PVI scores from Model 11.2 since it contains data points for all the dates of the survey. Below we describe the PVI score variables: 

- ToxPi_Score (represents the COVID-19 Pandemic Vulnerability Index (PVI); derived from the below 12 key indicators; values of ToxPi_Score and other indicators range between [0,1]). 

- Infection_Rate_Transmissible_Cases (represents the number of contagious individuals relative to the population, computed as population size divided by cases from the last 14 days; greater number indicates more likely continued spread of disease).
- Infection_Rate_Disease_Spread (fraction of total cases that are from the last 14 days; with values near 1 during exponential growth phase, and declining to zero over 14 days if there are no new infections).
- Pop_Concentration_Pop_Mobility (measured by the "Daytime Population Density" and "Baseline Traffic"; higher values are associated with higher spread of infection because more people are in closer proximity to each other).
- Pop_Concentration_Residential_Density (data from the American Community Survey (ACS) on families in multi-unit structures, mobile homes, among others; higher values are associated with higher spread of infection because more people are in closer proximity to each other).
- Intervention_Social_Distancing (change in overall distance travelled and the change in nonessential visits relative to baseline (previous year), based on cell phone mobility data; higher values (lesser social distancing) are associated with higher spread of infection).
- Intervention_Testing (population divided by tests performed; greater numbers indicate less testing which might increase the spread of infection).
- Health_&_Environment_Pop_Demographics (% Black and % Native from CHR (County Health Rankings and Roadmaps)).
- Health_&_Environment_Air_Pollution (average daily density of fine particulate matter in micrograms per cubic meter (PM2.5) from 2014 Environmental Public Health Tracking Network; higher values are associated with more severe outcomes from COVID-19 infection). 
- Health_&_Environment_Age_Distribution (% age 65 or older; older people have been associated with more severe outcomes from COVID-19 infection).
- Health_&_Environment_Co-morbidities (premature death, smoking, diabetes, and obesity; higher values are associated with more severe outcomes from COVID-19 infection). 
- Health_&_Environment_Health_Disparities (Uninsured and SVI Socioeconomic Status data from ACS; higher values are more likely to be undercounted in infection statistics, and may have more severe outcomes due to lack of treatment). 
- Health_&_Environment_Hospital_Beds (summation of hospital beds; it is static value, it is computed as # beds / CHR population; lesser values (lesser hospital beds) might be associated with more severe outcomes from COVID-19 infection). 

as well as two fields from model 12.4 that were added on Jan 13, 2021 (so may be blank for many participants/dates). Of note, these two fields were not considered for the ToxPi_Score we extracted (from model 11.2). 

- Health_&_Environment_Hospital_Ventilators (% of ventilators in use; this is the % of ventilators across all medical facility that are being used by patients for any medical condition)
- Intervention_Vaccines (% of unvaccinated residents; disease spread will be reduced with fewer unvaccinated residents).


Note that, for all of the forms 1/2/3:

- every "missing" time point has been added, with NAs if subject did not respond
   e.g. if a participant fills a survey at baseline, and then only on
   week 10, every other week (2,4,6,8,12,etc) would be there

- we created additional time-related fields:
  - VISIT_ABSOLUTE_DAY - days elapsed since the beginning of the study on 4/4/2020, 0-indexed
  - VISIT_WEEK - numeric field with the # of the week for that participant (0-24)
  - RESPONDED - 1 if the participant filled this form at that time point, 0 otherwise
     (if 0, the time point is filled with NAs)

------------------------------------------------------------------------------------
What have you done to the forms? 
------------------------------------------------------------------------------------

Across all forms, we

- dropped all the text fields (e.g. all those named *_SPFY)
- combined two lines of headers into a single one
- add a SUBJECT_STATUS field (where available), indicating if the participant is a NIH HV or Patient

In addition, for the "all_time_points" forms

- every "missing" time point has been added, with NAs in most fields
- we created additional time-related fields:
  - VISIT_ABSOLUTE_DAY - days elapsed since the beginning of the study on 4/4/2020, 0-indexed
  - VISIT_WEEK - numeric field with the # of the week for that participant (0-24)
  - RESPONDED - 1 if that time point is present, 0 otherwise

Per form:

1) clinical_history

- the complex encoding of ['HOSPITALIZATION_TREATMENT','HOSPITALIZATION_TREATMENT_2','HOSPITALIZATION_TREATMENT_3','HOSPITALIZATION_TREATMENT_4','HOSPITALIZATION_TREATMENT_5','HOSPITALIZATION_TREATMENT_6'] is converted into 0/1 indicators
  HOSPITALIZATION_MEDICAL
  HOSPITALIZATION_MENTAL_HEALTH
  TREATMENT_ALCOHOL_DRUG
  COUNSELING_MENTAL_HEALTH
  TREATMENT_MENTAL_HEALTH
  NO_CLINICAL_HISTORY

- other complex encodings for a particular health problem (e.g. ['CLIN_HX_CANCER','CLIN_HX_CANCER_2','CLIN_HX_CANCER_3','CLIN_HX_CANCER_4','CLIN_HX_CANCER_5','CLIN_HX_CANCER_6']) are converted into separate self, children, and family indicators (e.g. CLINICAL_CANCER_SELF, CLINICAL_CANCER_CHILDREN, CLINICAL_CANCER_FAMILY)

2) demographics

- SUBJECT_SUBSAMPLE2 is 1 or 2, indicating which subsample (exploratory or confirmatory) a subject belongs to. Sampling was done by stratifying by SEX,AGE,EDUCATION, and INCOME, and assigning subjects to subsamples 1 or 2 after that. NIH HV and Patients were included in group 1.
- SUBJECT_SUBSAMPLE4 is the same, but with 4 groups rather than 4 (for machine learning analyses, we might want 2 for training, 1 for validation, 1 for final testing)
- ZIP code is converted to 3 digits (if it is made of digits) or NA otherwise
- COUNTRY_OUTSIDE_US is replaced by IS_IN_US indicator (as many hundreds of people specify variants of USA, or county names)
- GENDER (can only take M/F/? values) is converted to SEX_MALE / SEX_FEMALE 0/1/? indicators
- LGBT_IDENTITY_* coding is converted to 0/1/? indicators
- RACE_* coding is converted to 0/1/? indicators
- ETHNICITY_1 coding is converted to ETHNICITY_NOTLATINO, ETHNICITY_LATINO, ETHNICITY_UNKNOWN 0/1 indicators
- SETTING_* coding is converted to SETTING_URBAN, SETTING_RURAL, and SETTING_SUBURBAN indicators
- TRANSPORTATION_* coding is replaced by indicators for each value
- MARITAL_STATUS_* coding is replaced by indicators for each value
- EMPLOYMENT_* coding is replaced by indicators for each value

3) FIGS

- the various family member fields (e.g. NERVES_EMOTIONS_MOTHER, NERVES_EMOTIONS_FATHER,...) are combined into:
  *_Yourself - whether or not the respondent indicated a positive response (e.g., NERVES_YOURSELF)
  *_RELATIVES - whether or not any first-order relatives have the condition (e.g., NERVES_RELATIVES)

4) WHODAS

- WHODAS fields are included individually (but WHODAS_S* fields can be added into one)

5) COVID-19

- combined into COVID-19-R, with all missing fields marked as NA

6) end_of_study

- added BMI field (easier to use as single measure of obesity), as well as WEIGHT_KG and HEIGHT_M
- replaced WEIGHT_CHANGE by indicators for each direction WEIGHT_INCREASE/SAME/DECREASE

7) social_support

- renamed fields from SOC* to specific questions
- subtracted 1 from all scales so that "never" is 0, as is in the other forms

8) emotional_support

- renamed fields from SOC* to specific questions
- subtracted 1 from all scales so that "never" is 0, as is in the other forms

9) personality

- renamed fields from BFIN* to specific questions
------------------------------------------------------------------------------------
How did you impute missing values as zero?
------------------------------------------------------------------------------------
We used two “types” of zeroing (toggled with the variables shown below).

zero_missing_if_response_q_set: For a given set of related questions (e.g., "check the box corresponding to your gender identity"), recode non-responses within the question set as 0. For example, if a respondent designates their gender as male but does not check the "trans male" box, encode "trans male" as 0. If a respondent doesn't answer any of the questions in a given set of related questions, we leave these responses as missing.

This was used for the following question sets:
Demographics form
- CURRENT_GENDER_*
- LGBT_IDENTITY_*
- RACE_*
- ETHNICITY_*
- EMPLOYMENT_*
- PARTICIPATE_STUDY_*
COVID-19 (both COVID-19-R and COVID-19-previous, with minor adjustments for form question differences):
- COVID_19_ADULT_LIVE_HOUSE_*
- Information consumption (COVID_19_ADULT_READ_SOCIAL_MEDIA, …, COVID_19_ADULT_INFORMATION_OTHR)
Household changes (COVID_19_ADULT_LOST_JOB, …, COVID_19_ADULT_HAPPEN_NONE)
COVID-19 symptoms (COVID_19_ADULT_NO_SYMPTOMS, …, COVID_19_ADULT_SYMPTOMS_OTHER)
- COVID_19_ADULT_EXERCISE_*
- COVID_19_ADULT_MNDFLNSS_*
- COVID_19_ADULT_HOBBY_*
Clinical form
- Clinical conditions (ASTHMA, …, OTHER_HEALTH_COND, NONE)
- Outpatient/virtual medical care (OUTPATIENT_MED_CARE, …, OUTPATIENT_MED_CARE_OTHER)
- Inpatient medical care (INPATIENT_MED_HOSP, …, INPATIENT_SERVICE_OTHER)
Everyday discrimination
- FREQ_*

zero_missing_if_logically_implied: There are a couple sets of questions for which missing responses should be necessarily coded as zero, given the answer to a prior question. For example, if a respondent indicates that they have never been a war zone, and a later question asks whether or not they have been injured in a war zone, the answer to the latter question is necessarily "no." So, for these types of question sets, recode responses that are logically false as 0 (pseudocode e.g., if ever in war zone == 0, then injured in war zone = 0)

This was used for the following sets of questions (“if Q1 == 0, then Q2, Q3, … = 0 if Q2, Q3, … are missing” noted as Q1 -> Q2, Q3, … for concision):
FIGS
- ALCOHOL_CAUSE_PROBLEM -> ALCOHOL_TREATMENT
- DRUG_PROB -> DRUG_PROB_TREATMENT
Clinical form
- COVID_19_TEST -> COVID_19_POS_TEST
Chronic pain
- PAIN_CHRONIC -> PAIN_CHRONIC_COND_PRESENT_6_MONTH
BTQ
- BTQ_WAR_ZONE -> BTQ_WAR_ZONE_LIFE_DANGER, BTQ_WAR_ZONE_INJURED
- BTQ_SERIOUS_ACCIDENT -> BTQ_SERIOUS_ACCIDENT_LIFE_DANGER, BTQ_SERIOUS_ACCIDENT_INJURED
- BTQ_MAJOR_NATURAL_TECH_DIASTER -> BTQ_MAJOR_NATURAL_DISASTER_LIFE_DANGER, BTQ_MAJOR_NATURAL_DISASTER_INJURED
- BTQ_LIFE_THREATENING_ILLNESS -> BTQ_LIFE_THREATENING_ILLNESS_DANGER
- BTQ_PHYSICAL_PUNISH_PARENT_BEFORE_18 -> BTQ_PHYSICAL_PUNSH_PRNT_BFRE_18_LFE_DNGR, BTQ_PHYSICAL_PUNSH_PRNT_BFRE_18_INJURED
- BTQ_ATTACK_BY_STRANGER_FRIEND -> BTQ_ATTACKED_STRANGER_FRIEND_LIFE_DANGER, BTQ_ATTACKED_STRANGER_FRIEND_INJURED
- BTQ_UNWANTED_SEXUAL_CONTACT -> BTQ_UNWANTED_SEXUAL_CONTACT_LIFE_DANGER, BTQ_UNWANTED_SEXUAL_CONTACT_INJURED
- BTQ_SERIOUS_INJURED_KILLED -> BTQ_FEAR_KILLED_SERIOUSLY_INJURED
- BTQ_FAMILY_DIED_VIOLENTLY -> BTQ_FAMILY_DIED_VIOLENTLY_INJURED
AUDIT
- AUDIT_ALCOHOL_FREQ_DRINK -> AUDIT_ALCOHOL_PER_DAY, …, AUDIT_REMEMBR_NIGHT_PRIOR_FREQ
------------------------------------------------------------------------------------
How did you handle subjects who filled out both the first version of the COVID-19 survey ("COVID-19-previous") and
the version that was used for the majority of the study ("COVID-19-R")?
------------------------------------------------------------------------------------
For respondents who filled out both COVID-19-previous and COVID-19-R at baseline, we set the COVID-19-R form as
the respondents' "default" responses at baseline.
We then went through each question from COVID-19-R...
    - If the question was answered by the respondent (i.e., a response was NOT missing), we kept this value as the
      respondent's baseline COVID19 response.
    - If the question was NOT answered by the respondent in COVID-19-R, but WAS answered by the respondent in
      COVID-19-previous, we used the value from COVID-19-previous as the respondent's baseline COVID19 response
      for that question.
------------------------------------------------------------------------------------
How did you handle subjects who filled out the COVID-19-R survey twice at the end of the study?
------------------------------------------------------------------------------------
There were a number of subjects who submitted duplicated responses to the COVID-19-R survey at the final time point.
For these duplicate EOS responses, we selected the row corresponding to a response on the day closest to the rest of the EOS
responses ("primary_row") and the response not closest to the rest of EOS forms ("secondary_row"). Then, for each
question in primary_row, if the response is nan in primary_row but NOT nan in secondary_row, we set the response to
that question in primary_row to the corresponding response in secondary_row. We then used this revised primary_row
as the merged response for duplicate EOS submissions.
------------------------------------------------------------------------------------
Automated fixes:
------------------------------------------------------------------------------------

1) demographics

- mistakes are in one of the age fields, which is supposed to be numeric
 		"52-"
		"76 years old"
 		"55 7:56est"
		"29 years"
 		"55 years old"
		- 444
		and many more like this

2) COVID-19 R survey

- 6 participants completed both the original and revised survey at baseline. We manually removed duplicate original survey responses for: CV03297, CV03460, CV03520, CV03533, CV03618, CV03621

3) clinical_form_end_of_study (now these changes are made automatically inside the code)

- mistakes in height_inches field, 5%22 replaced by 5
- mistakes in weight_lbs,
     "About 154 (don't have a scale)" -> 154
     "160 lbs" -> 160
     "140 lbs" -> 140
     "168 lbs" -> 168
     "154 lbs- (6lb increase)" -> 154
     "190 lbs" -> 190
     "120 lbs" -> 120
     "165 lbs" -> 165
     "I don't know." -> ,,
     "~190 lbs" -> 190
     "~140" -> 140
     "130 estimate" -> 130
     "170 pounds" -> 170
     and many more like this

4) chonic_pain
- mistakes in PAIN_USUAL_ACTIVIY,
      "none" -> 0,
      "one" -> 1,
      "3 - IC related pain" -> 3,
      "Work is a good distraction so maybe 4." -> 4,
      "3-5" -> 4,
      "1/4 of days" -> 45,
      "One third" -> 60,
      "NEARLY EVERY DAY" -> 180,
      and many more like this


------------------------------------------------------------------------------------
End-of-study measures scoring conventions
------------------------------------------------------------------------------------
At the request of Jacob Shaw, Dr. Chung, etc., we added a number of summary measures for some of the end-of-study forms, detailed below.

BTQ
- BTQ_CRITERION_A_EVENT_EXPOSURE
  - Exposure to an event should be scored as positive if a respondent says yes to either:
      1. Life threat or serious injury for events 1- 3 and 5- 7;
      2. Life threat for event 4;
      3. Serious injury for event 8;
      4. “Has this ever happened to you?” for events 9 and 10.
- BTQ_N_TRAUMATIC_EVENTS: Summed score generated between 0-10 indicating number of events experienced that meet criterion of traumatic event

social_support
- SUM_SOCIAL_SUPPORT: sum of 8 Likert scales

emotional_support
- SUM_EMOTIONAL_SUPPORT: sum of 8 Likert scales

everyday_discrimination
- situation-based encoding for whether behavior occurred with any frequency at all. Recorded as EVER_*, where * is the discriminatory behavior variable.
  - E.g., EVER_LESS_COURTESY_OTHERS = 1 if LESS_COURTESY_OTHERS > 0.
          EVER_LESS_COURTESY_OTHERS = 0 if LESS_COURTESY_OTHERS == 0.
          EVER_LESS_COURTESY_OTHERS = nan if LESS_COURTESY_OTHERS == nan.
- frequency based coding: sum of scaled items recorded in SUM_EVERYDAY_DISCRIM column

personality
- Likert scale responses recoded as follows:
  - Answer: original code value -> new code value
    Disagree strongly: 1 -> -2
    Disagree a little: 2 -> -1
    Neither agree nor disagree: 3 -> 0
    Agree a little: 4 -> 1
    Agree strongly: 5 -> 2
- Also added BFI-10 scales for EXTRAVERSION, AGREEABLENESS, CONSCIENTIOUSNESS, NEUROTICISM, OPENNESS_EXPERIENCE

motivation
- Added variables for BAS drive (BAS_DRIVE), BAS fun seeking (BAS_FUN_SEEKING), BAS reward responsiveness (BAS_REWARD_RESPONSIVENESS) and BIS

BRS
- Likert scale responses recoded as in personality survey to the range [-2, 2]
- Average of scaled responses recorded as BRS_AVERAGE_RESILIENCE

chronic_pain
- PAIN_CHARACTERISTIC_INTENSITY: Characteristic pain intensity score – range 0-100, mean intensity ratings for current, worst, and average pain, then multiplied by 10
- PAIN_DISABILITY_SCORE: ranges from 0-100, mean rating for difficulty performing daily, social and work activities, then multiplied by 10
- DISABILITY_POINTS_SCORE: ranges from 0-6, combination of ranked categories of number of disability days and disability scores

